GEO 3/446 Exercise #1

Bivariate Choropleth Mapping in R using CDC PLACES and SVI Data

Author

C. Scott Smith, PhD AICP (Instructor)

Published

September 13, 2023

Purpose

This exercise uses Centers for Disease Control PLACES and Social Vulnerability Index (SVI) data in RStudio and ArcGIS Pro to create a series of bivariate choropleth maps and related figures and tables to explore select public health and sociodemographic relationships.

Students will save maps and summaries of spatial analyses in a PowerPoint presentation (or, optionally, a document created with Quarto markdown in R) to be submitted via the course web page. Students are welcome to work collectively with classmates to overcome obstacles experienced while completing the required tasks, although all submissions will be unique given that students will choose their own custom study area and variables for mapping.

The sections below step you through the process to create the above deliverable. For some of the more complicated procedures, online videos will be made available on the course D2L under the Exercise #1 content folder.

Step 1. Create new R project, folders, and data processing script

Effective file management is a foundational aspect of geographic information systems (GIS) that contributes to data organization, integrity, collaboration, efficiency, security, scalability, documentation, compliance, version control, and data accessibility. Ignoring file management can lead to data loss, errors, and inefficiencies in GIS projects, whereas conscientious file management enhances the overall efficiency and reliability of GIS workflows.

That said, in RStudio, create a project within a new directory (e.g., GEO336/exercise_01) of your general course folder. Here you will save files related to exercise #1. Create the following folders within the new exercise-specific directory: “scripts” (for storing R code), “layers” (for storing geographic feature layers) and “maps” (for storing ArcGIS Pro projects).

Create data processing script in R

In RStudio, create a new script file for storing data processing code relevant to this exercise. Create the blank script file by first navigating to the “scripts” directory you created in your project working directory and clicking “New Blank File” in the files pane. Name the script something intuitive (e.g., “exercise01.R”).

Activate relevant packages

Next, use the following code to activate needed R packages. If the packages are not yet installed, you can install multiple packages by passing a vector of package names to the “install.packages” function, for example:

install.packages(c("censusapi", "tigris"))

# install packages if necessary
# install.packages(c("censusapi", "tigris"))

# activate packages
library(censusapi) # retrieving census attribute data
library(tigris) # retrieving census geometries
library(sf) # manipulating geometry data
library(dplyr) # data wrangling, pip format
library(tidyverse) # data wrangling
library(biscale) # bivariate mapping
library(corrplot) # correlation plot
library(scales) # used for rescaling data

options(scipen=999, digits = 2) # format output for data tables

Section 2. Select a study area and download boundary files

For this exercise, you will select a custom, large (i.e., having over 500,000 population), urban county located in the contiguous United States to be your study area. Browse the interactive map of US counties below to identify a study area that interests you. Make note of the county’s name and unique “geoid”. The “geoid” combines the unique two-digit state code with a three-digit county code (e.g., 17031, for example, is the geoid for Cook County, Illinois).

Code
# Make interactive map of US counties by population using leaflet package

# assign bins for population thematic map loosely based on quintile breaks
counties_bins <- c(0, 25000, 100000, 500000, 999999, Inf) 

# create palette using colorblind/accessible colors 
counties_pal <- colorBin(
  palette="viridis", 
  domain=us_counties_population_geo$total_pop, 
  na.color="transparent",
  bins = counties_bins,
  reverse = TRUE
  )

# format/generate popup labels for map
labels = sprintf(
  "<strong>%s, %s</strong><br/>
  Population (2021): %s<br/>
  GEOID: %s <br/>
  Quintile group: %s",
  us_counties_population_geo$NAME, 
  us_counties_population_geo$STATE_NAME,
  format(us_counties_population_geo$total_pop, big.mark = ","),
  us_counties_population_geo$GEOID,
  us_counties_population_geo$total_pop_q) %>% 
  lapply(htmltools::HTML)

# create thematic map using leaflet
leaflet() %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  addPolygons(group = "US counties",
              data=us_counties_population_geo %>% st_transform(crs=4326),
              # fillColor = "orange",
              fillColor = ~counties_pal(total_pop),
              weight = 0.5,
              opacity = 0.5,
              color = "white",
              dashArray = "3",
              fillOpacity = 0.7,
              highlight = highlightOptions(
                weight = 5,
                color = "#666",
                dashArray = "",
                fillOpacity = 1,
                bringToFront = FALSE),
              label = labels,
              labelOptions = labelOptions(
                style = list("font-weight" = "normal", padding = "3px 8px"),
                textsize = "15px",
                direction = "auto")) %>%
  addLegend("topright", pal = counties_pal, values = us_counties_population_geo$total_pop,
            title = paste0("Population (2021)"),
            opacity = 1) %>%
  addScaleBar(position = "bottomleft") %>%
    addFullscreenControl()
Figure 1: Total Population by US County, 2021

Set geoid and coordinate reference system (CRS) variables

At times it is useful to set variables in your code for objects that have the potential to regularly change across use cases. In this example, you will assign values to the variables for your county’s unique identification code and coordinate reference system code (i.e., county_geoid and crs_code). Different regions of the world may use regional or localized CRSs for specific needs. Having a standard CRS makes it easier to work with global data while still allowing for local customization when necessary.

It takes two steps to identify your selected county’s CRS code. First identify the UTM NAD83 zone in which the county is located (e.g., the UTM zone for Cook County, Illinois is UTM Zone 16N) using the ArcGIS Hub web page. Next, search the UTM Zone using the EPSG web page to return the associated code (e.g., the EPSG code for UTM Zone 16N is 26916). Append the following code to the bottom of your exercise data processing script to set the five-digit county identifier and coordinate reference system (CRS) variables (i.e., county_geoid and crs_code) for your study area.

# replace the "17031" value with the unique geoid for your county
county_geoid = "17031"

# replace the 26916 EPSG code value with the CRS code associated with your county location
crs_code = 26916 

Download county and census tract layers for your study area

The tigris package in R can be used to download US county and other geographic layers managed by the US Census Bureau. Specifically, these and other cartographic boundaries in the census geographic hierarchy are made available via the US Census Bureau’s TIGER/Line program. The below code can be used in R to download all counties in the United States in a standard NAD83 coordinate reference system, EPSG 4269. Use the below code to download the county and the census tract boundaries for your study area.

county_boundary_geom <- counties(state =  substr(county_geoid,1,2),
                           cb=TRUE, 
                           class="sf", 
                           year = 2010) %>%
  mutate(sqmi = CENSUSAREA,
         geoid_county = paste0(STATE,COUNTY)) %>%
  select(geoid_county, sqmi) %>%
  filter(geoid_county == county_geoid) %>%
  st_as_sf() %>%
  st_transform(crs=crs_code)

tract_boundaries_geom <- tracts(state =  substr(county_geoid,1,2), 
                           county = substr(county_geoid,3,5), 
                           cb=TRUE, 
                           class="sf", 
                           year = 2010) %>%
  mutate(sqmi = CENSUSAREA,
         geoid_county = paste0(STATE,COUNTY),
         geoid_tract = paste0(STATE,COUNTY,TRACT)) %>%
  select(geoid_county, geoid_tract, sqmi) %>%
  st_transform(crs=crs_code) %>%
  st_as_sf()

# create a simple plot of the county  
plot(county_boundary_geom['geoid_county'])

# create a simple plot of the census tracts  
plot(tract_boundaries_geom['geoid_tract'])

Section 3. Assemble public health and vulnerability indicators by census tract

Now that you have your custom geographic layers, the next step is to download some meaningful attributes for analysis. In this section you will combine public health and social vulnerability indicators by census tract into a single feature layer suitable for mapping in ArcGIS Pro.

Import, explore and subset public health prevalence data

The Centers for Disease Control (CDC) PLACES data provides health outcome, prevention, risk behavior and health status estimates for small areas across the United States. These data allow local health departments and jurisdictions to better understand the relative burden and geographic distribution of health measures in their areas and assist them in planning public health interventions. The data are available for download on the CDC PLACES web page by county, place (both incorporated municipalities and census designated places), census tract, and ZIP Code Tabulation Area (ZCTAs) across the United States.

For this exercise, we will use census tracts–the most granular scale available in the PLACES dataset–as the unit of analysis. Use the following code to import the PLACES dataset by census tract and data dictionary into R. This is a large file (~ 57MB), so please be patient. The code also subsets the data to your county of interest.

places_tracts_all <- read_csv("https://data.cdc.gov/api/views/yjkw-uj5s/rows.csv?accessType=DOWNLOAD") 

# subset PLACES to study area county using county_geoid variable
places_tracts_county <- places_tracts_all %>%
    filter(CountyFIPS==county_geoid) %>%
  select(-ends_with("Crude95CI"), -Geolocation) %>%
  rename_with(~str_replace(.,'_CrudePrev', ''))

Review and format public health data for your county

In the July 2023 release of the PLACES dataset, 29 of the measures are based on Behavioral Risk Factor Surveillance System (BRFSS) 2021 data and 7 are based on the 2020 survey data. Scroll through the table below and identify four indicators you’d like to explore further and map. Note the “MeasureID” for each of the four indicators.

Code
# download PLACES data dictionary
places_data_dictionary <- read_csv("https://data.cdc.gov/api/views/m35w-spkz/rows.csv?accessType=DOWNLOAD")

# create data dictionary for PLACES measures
# select at least four measures to evaluate in your county
places_data_dictionary %>%
  select(MeasureID, Name = `Measure full name`, Category = `Category name`) %>%
  datatable(
    class = 'cell-border stripe',
    rownames = FALSE,
    options = list(pageLength = 10)
  ) %>%
  formatStyle(columns = c(1, 2, 3),
              fontSize = '75%')
Figure 2: PLACES Data Dictionary

Append the below code to your custom data processing script. Change the MeasureIDs in the “places_measures_list” with the MeasureIDs of the four indicators you’d like to evaluate. Run the code to create an abbreviated table that includes only the unique census tract geoid and values for the selected PLACES measures.

# Customize the four MeasureIDs in this list with those you'd like to evaluate
places_measures_list <- c("CANCER","CHD","STROKE","DIABETES")

# create detailed data table of selected PLACES measures in wide format
places_tracts_county_sub <- places_tracts_county %>%
  select(geoid_tract = TractFIPS,
         total_pop = TotalPopulation,
         starts_with(places_measures_list))

Import and format social vulnerability data

The next step is to add social vulnerability data to the custom geographic layer to allow more analytical options. The CDC’s Social Vulnerability Index (SVI) was initially designed to help public health officials and emergency response planners identify communities that will most likely need support before, during, and after a hazardous event. However, it has proven to be an effective indicator for evaluating other public health phenomena.

Similar to the PLACES dataset, we will use census tracts–the most granular scale available in the SVI dataset–as the unit of analysis. Append the following code into your data processing script to import the SVI dataset by census tract into R. This is a large file (> 200MB), so please be patient. The code also subsets the data to your county of interest.

# Download SVI by census tract
svi_tracts_all <- read_csv("https://data.cdc.gov/api/views/4d8n-kk8a/rows.csv?accessType=DOWNLOAD&bom=true&format=true")

# subset SVI to study area county using county_geoid variable
svi_tracts_county <- svi_tracts_all %>%
  mutate(COUNTY_FIPS = substr(FIPS,1,5)) %>%
  filter(COUNTY_FIPS==county_geoid) %>%
  select(geoid_tract = FIPS,
         starts_with(c("EP_","RPL_"))) %>%
  na_if(-999)

Review and format SVI data

The 2020 release of the SVI ranks census tracts on 15 social factors, including unemployment, minority status, and disability, and further groups them into four related themes. Thus, each tract receives a ranking for each census variable and for each of the four themes, as well as an overall ranking. Scroll through the table below and identify four SVI indicators you’d like to explore further and map. Note the “MeasureID” for each of the four indicators.

Code
# download SVI data dictionary from GitHub
svi_data_dicionary <- read_csv("https://raw.githubusercontent.com/justenvirons/pedagogy/main/GEO346_2023_FallQuarter/Exercise_01/data/svi_datadictionary.csv") %>%
  filter(str_detect(Name, "EP_|RPL_")) %>%
  rename(MeasureID = Name) %>%
  arrange(Theme)

svi_data_dicionary %>%
    datatable(
    class = 'cell-border stripe',
    rownames = FALSE,
    options = list(pageLength = 10)
  ) %>%
  formatStyle(columns = c(1, 2, 3),
              fontSize = '75%')
Figure 3: SVI Data Dictionary

Similar to the PLACES data, append the below code to your custom data processing script. Change the MeasureIDs in the “svi_measures_list” with the MeasureIDs of the four indicators you’d like to evaluate. Run the code. The code will perform an attribute join between the census tract cartographic boundaries, the custom PLACES table, and the custom SVI table to create a comprehensive geographic layer with public health and SVI attributes. Lastly, save the layer as a shapefile to your “layers” directory using below code along with the county boundary.

# Replace the four SVI variable names in this list with those you'd like to evaluate
svi_measures_list <- c("RPL_THEME1","RPL_THEME2","RPL_THEME3","RPL_THEME4")

# create detailed data table of selected SVI factors 
svi_tracts_county_sub <- svi_tracts_county %>%
  select(geoid_tract, contains(svi_measures_list))  %>%
  mutate_at(svi_measures_list, function(x) rescale(x,to = c(0,100))) %>% 
  drop_na()

# join SVI data to PLACES table with census tract geometries
places_svi_tracts_county_sub_geom <- tract_boundaries_geom %>%
  left_join(places_tracts_county_sub, by="geoid_tract") %>% 
  left_join(svi_tracts_county_sub, by = "geoid_tract") %>%
  st_as_sf()

# create test plot using "geoid_tract" variable
plot(places_svi_tracts_county_sub_geom['geoid_tract'])

# save comprehensive census tract and county geographic layers as shapefiles in layers directory
st_write(places_svi_tracts_county_sub_geom, "C:/git_repos/pedagogy/GEO346_2023_FallQuarter/Exercise_01/layers/places_svi_tracts_county_sub_geom.shp", append = FALSE)
st_write(county_boundary_geom, "C:/git_repos/pedagogy/GEO346_2023_FallQuarter/Exercise_01/layers/county_boundary_geom.shp", append = FALSE)

Step 4. Examine bivariate relationships of selected variables

Create a bivariate correlation plot to explore statistical relationships between the PLACES and SVI variables you selected.

Create bivariate correlation plot

A bivariate correlation plot is a graphical representation used to visualize relationships between pairs of variables in a dataset. This type of plot is particularly useful to examine how the SVI and PLACES variables interact statistically and whether there is a correlation between them. Append the below code to your data processing script to create two custom bivariate correlation plot using the variables you selected.

places_svi_tracts_sub <- places_svi_tracts_county_sub_geom %>%
  st_drop_geometry() %>%
  drop_na() %>%
  select(-c(geoid_county,geoid_tract,sqmi,total_pop))

variable_biv_correlations <- cor(places_svi_tracts_sub)

corrplot(variable_biv_correlations, method="number", col=c("black","white"), order="hclust", bg="lightgrey", tl.col = "grey")

corrplot(variable_biv_correlations, col=c("black","white"), order="hclust", bg="lightgrey", tl.col = "grey")

(a) Bivariate Correlation Plot A

(b) Bivariate Correlation Plot B

Figure 4: Bivariate Correlation Plots

Step 5. Create bivariate choropleth maps in ArcGIS Pro

Bivariate or multivariate maps are a type of thematic cartographic representation that displays two or more variables on a single map using a common symbology. The technique can reveal relationships between variables more effectively than a side-by-side comparison of multiple corresponding univariate maps, but also has the potential to be confusing if the symbols and patterns are overly complex and difficult to understand. Follow the video (pending) to create custom maps in ArcGIS Pro.